Phase- Transition in Binary Sequences with Long-Range Correlations 
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Motivated by novel results in the theory of correlated sequences, we analyze the dynamics of 
random walks with long-term memory (binary chains with long-range correlations). In our model, 
the probability for a unit bit in a binary string depends on the fraction of unities preceding it. 
We show that the system undergoes a dynamical phase-transition from normal diffusion, in which 
the variance Dl scales as the string's length L, into a super-diffusion phase {Dl ^ L^^'"'), when 
the correlation strength exceeds a critical value. We demonstrate the generality of our results with 
respect to alternative models, and discuss their applicability to various data, such as coarse-grained 
DNA sequences, written texts, and financial data. 



Dynamical systems with long-range spatial (and/or 
temporal) correlations are attracting considerable inter- 
est across many disciplines. They are identified in phys- 
ical, biological, social, and economic sciences (see e.g., 
[1-6] and references therein). Of particular interest are 
situations in which the system can be mapped onto a 
mathematical object, such as a correlated sequence of 
symbols, preserving the essential statistical properties of 
the original system. 

One of the methods most frequently used to obtain in- 
sight into the nature of correlations in a dynamical sys- 
tem consists of mapping the space of states onto two 
symbols [5]. Thus, the problem is reduced to the explo- 
ration of the statistical properties of correlated binary 
chains. This can also be viewed as the analysis of a 
history-dependent random walk. Random walk is one 
of the most ubiquitous concepts of statistical physics. It 
lends applications to numerous scientific fields (see e.g., 
[7-13] and references therein). 

It is well established that the statistical properties 
of coarse-grained DNA strings and written texts signif- 
icantly deviate from those of purely random sequences 
[2,14]. Financial data (such as stock market quotes) are 
similarly far from being pure-diffusive. Moreover, these 
systems exhibit "super-diffusive" behavior in the sense 
that the variance D{L) grows asymptotically faster than 
L (where L is the length of the considered text). Specif- 
ically, D ^ L", with Q > 1 [5]. Such a remarkable (and 
essentially universal) phenomenon can be attributed to 
long-range positive correlations. Systems with such cor- 
relations may be anticipated to exhibit a dynamical phase 
transition (from normal to super diffusive behavior) at 
some critical correlation strength. 

Thus, the problem of random walk where the jumping 
probabilities are history-dependent is of great interest for 
understanding the behavior of systems with long-range 
correlations, such as DNA strings, written texts, and fi- 
nancial data. The aim of the present Letter is to analyze 
this problem, and to provide a simple yet generic an- 
alytical description of the statistical properties of these 



systems. 

We begin by solving a simple model which incorpo- 
rates long-range correlations into an otherwise random 
sequence. We consider a discrete binary string of sym- 
bols, fli = {0, 1}, in which the conditional probability of 
a given symbol (say, a unit bit) occurring at the position 
i + 1 is history-dependent, and given by 



p{k,L) 
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L-2k 
L + La 



(1) 



where k is the number of such symbols (unities) appear- 
ing in the preceding L bits. The correlation parameter 
fi, where — 1 < /i < 1, determines the strength of corre- 
lations in the system. The persistence condition fi > 
implies that a given symbol in the preceding sequence 
promotes the birth of a new identical symbol. On the 
other hand, in the anti-persistence region /i < 0, each 
symbol inhibits the appearance of a new identical sym- 
bol. The parameter Lq > is a constant transient time. 
For L <C io the sequence is approximately random (un- 
correlated) , whereas for L ^ Lq the effect of correlations 
takes over [15]. 

In this model, the conditional probability p(fc, L; /j, Lq) 
depends on the fraction of unities (or zeroes) in the pre- 
ceding bits, and is independent of their arrangement. 
This allows one to obtain an analytical description of the 
system's dynamical behavior. As we shall demonstrate 
below, this simple model provides a good quantitative 
description of the observed statistical properties of vari- 
ous natural systems, such as coarse-grained DNA strings, 
written texts, and financial data. 

The probability P{k, L + 1) of finding k identical sym- 
bols (say, unities) in a sequence of length L + 1 follows 
the evolution equation 



P(fc,L + 1) = [l-p{k,L)]P{k,L) 

+p(k^l,L)P{k-l,L) 



(2) 



Crossing to the continuous limit, one obtains the Fokker- 
Planck diffusion equation for the correlated process 
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where x = 2k — L. The evolution equation (3) along with 
the initial condition P{x, t = 0) = 8{x), has a solution in 
the form of a Gaussian distribution 
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where the variance D{L) is given by 



D{L;^,Lo) = 



L + Lol 
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Equation (5) breaks down at the special case /i 
which case the variance is given by 



D(L;^„Lo) = (L + io)ln( 



L + Lq 
Lo 



(6) 



Remarkably, one finds that the correlated system un- 
dergoes a dynamical phase transition at the critical cor- 
relation strength fic = ^- The variance D{L) of the cor- 
related sequence has three qualitatively different asymp- 
totic behaviors (in the L ^ Lq limit) 

r(f-2^)-iL M<Mc; 
D{L)^ \ LHL/Lo) Ai = Mc; (7) 

[ (2Ai - 1)-1Lo1-2'^L2a' ^>^, . 

Thus, for ^ < fic the asymptotic variance scales linearly 
with the string length, whereas for a history-dependent 
chain with strong positive correlations (/^ > /ic) the sys- 
tem is characterized by a super-diffusion phase, in which 
case D{L) grows asymptotically faster than L [16]. 

The analytical model can readily be extended to en- 
compass situations in which the binary sequence is biased. 
Let 



, 1/ L-2k 



(8) 



with —l<q<l. The distribution P{x,L) correspond- 
ing to this conditional probability is given by a Gaussian 
function, centered about the position 



= (i) = 



-L 



(9) 



L+Lo 



Thus, the drift velocity approaches an asymptotically 
constant value The variance D{L), unaltered by 

the bias is given by Eqs. (5) and (6). 

In order to confirm the analytical results, we perform 
numerical simulations of (discrete) binary sequences. 
Figure 1 displays the resulting scaled variance L~^D{L) 
of correlated strings with various different values of the 
correlation parameter /i. We find an excellent agreement 
between the analytically predicted results [see Eqs. (5) 
and (6)] and the numerical ones. 
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FIG. 1. The scaled variance L ^D{L) as a func- 
tion of the string length L. We present results for 
fi = -0.8, -0.4, 0, 0.2, 0.5, 0.8, and 0.9 (from bottom to top), 
with Lo = 100. The numerically computed asymptotic slopes 
agree with the analytical predictions [see Eqs. (5) and (6)] to 
within less than 1%. 



Robustness of the linear model - In order to show the 
generality of the model discussed above, we consider situ- 
ations in which the (history-dependent) jump probability 
is an arbitrary odd function [17] of the fraction ^ = j^f]^ 
of unities (zeroes) that appeared in the previous L sym- 
bols 



(10) 



For asymptotically large L, one always finds ^ ^ for 
non-ballistic diffusion, justifying a power-law expansion 
of F(^). As long as this expansion includes a linear term, 
the original differential equation (3) is recovered for large 
L. We therefore expect the previous analytical results 
[Eqs. (5) and (6)] to hold true for generic (non-linear) 
models as well. The generality of the model is illustrated 
in Fig. 2, in which we depicts results for various choices of 
the probability function F(^). As predicted, the results 
are found to agree with the linear model. 

Applications. - The robustness of the linear model (see 
Fig. 2) suggests that it may capture the essence of the 
underlying correlations in a diversity of systems in na- 
ture. We therefore examine the use of the results derived 
in the present work as an analytical explanation for the 
observed statistical properties of natural systems, such 
as DNA strings, written texts, and financial data. 

As mentioned, it is well established that these sys- 
tems often exhibit a significant deviation from random 
sequences [2,14], and are characterized by a "super- 
diffusive" behavior in which D ^ L"^, with a > 1 [5]. 
In such systems, super-diffusion may be attributed to 
long-range (positive) correlations. In fact, the analytical 
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FIG. 2. The scaled variance L~^D{L) for three different 
forms of the function F(^): ^, ■|sin(-|^), and tanh(^). We 
present results for = —0.8 and fj, = 0.8,with Lq = 100. The 
different curves are almost indistinguishable. 



model allows one to determine the correlation strength 
of these chains. 

Figm'e 3 depicts the scaled variance L^^D{L) cal- 
culated from DNA sequences of various organisms, as 
a function of the string length L. It is of consider- 
able interest to examine in such methods the statisti- 
cal properties characterizing the DNA of organisms in 
various evolutionary levels: Bacillus subtilis {Bacteria), 
Methanosarcina acetivorans (Archaea), and Drosophila 
melanogaster (Eukarya) [5,18]. The theoretical model 
provides a good description of the empirical data [19], 
attributing different correlation strengths /i to different 
organisms, as summarized in Table I. 

The super-diffusive behavior, shown in Fig. 3 to per- 
sist across very long sequences is highly suggestive of 
long-range correlation extending over more than one gene 
(e.g., ~ 5 X 10"* base-pairs in Drosophila). 

Next, we have applied the results of the analytical 
model to various coarse-grained written texts [2,14,5]. It 
has long been recognized that the corresponding binary 
strings are highly self-correlated. The present analyti- 
cal model enables one to determine quantitatively the 
strength of these inner correlations; see Table I. 

In Figure 4 we show the scaled variance of coarse- 
grained financial data (daily quotes of the Dow Jones 
Industrial Average, and the NASDAQ [20]). We note 
that the linear model underestimates the empirical vari- 
ance at short time scales. This fact can be traced back to 
short-term correlations in the markets. (It is interesting 
to note that the DJIA maintains an approximately nor- 
mal diffusive behavior for a period of about one month) . 
However, this short-term memory is washed out at longer 
time scales, in which case the analytical model provides 
a good description of the empirical results, as evident 
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FIG. 3. The scaled variance L~^D{L) as a function of the 
string length L, for coarse-grained DNA sequences of various 
organisms. The mapping and parameters used are given in 
Table I. Theoretical results [see Eq. (5)] are represented by 
curves. 

TABLE I. The correlation strength parameter for various 
binary strings. We use the following mappings: {A, G} 0, 
{C,r} 1 for DNA sequences [5,18]; (a to m) ^ 0, (n to z) 
1 for written texts [5]; and daily fall 0, daily rise 1 
for stock market quotes [20]. 



Data Type 



String Source 



DNA sequences Drosophila melanogaster 0.57 

Methanosarcina acetivorans 0.70 

Bacillus subtilis 0.86 

Written texts Alice's adventures in wonderland 0.58 

The Holy Bible in English 0.84 

Works on computer science 0.88 

Stock markets NASDAQ 0.39 

DJIA 0.76 



from Fig. 4. The corresponding values of the correlation 
parameter ^ are summarized in Table I. 

In summary, in this Letter we have analyzed the dy- 
namics of random walks with history- dependent jump 
probabilities. Our work was motivated not only by the 
intrinsic interest in such dynamical processes, but also 
by the flurry of activity in the fleld of long-range corre- 
lated systems, and by some universal statistical features 
observed in many different natural systems. 

We have broadened the study of binary strings to in- 
clude long-range correlations, extending throughout the 
length of the chain. Using a simple and exactly solvable 
model, we identify a dynamical phase transition, from 
normal diffusion [D{L) ^ L] to super-diffusive behavior 
[D{L) ^ L'^^], taking place as the correlation parameter 
/i exceeds its critical value. We show that in spite of the 
simplicity of the model, it is robust, and can easily be 
extended to describe various features (such as a biased 
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FIG. 4. The scaled variance L~^D{L) as a function of the 
sequence length L, for coarse-grained financial data: DJIA 
and NASDAQ daily quotes [20]. The mapping and parame- 
ters used are given in Table I. Theoretical results [see Eq. (5)] 
are represented by curves. 

history-dependent random walk or sub-difFusion) . 

Next, we have applied the analytical results of the 
model to various binary strings, extracted from very 
different natural systems, such as coarse-grained DNA 
sequences, written texts, and financial data. We find 
that the model adequately describes the long-term be- 
havior of these systems. Furthermore, the model pro- 
vides a straightforward method to measure the correla- 
tion strength of these systems. Our results can be applied 
to various natural systems, and may shed light on the 
underlying rules governing their dynamics. For example, 
the super-diffusive behavior of DNA sequences (see Fig. 
3) suggests long-range correlations extending across more 
than one gene. The model attributes different correlation 
strengths to different organisms. 
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